Internet protocol suite |
---|
Application layer |
Transport layer |
Internet layer |
Link layer |
802.1aq Shortest Path Bridging or SPB in computer networking is a technology that greatly simplifies the creation and configuration of carrier, enterprise, and cloud networks which virtually eliminates human error, while enabling multipath routing.[1][2][3] The technology provides logical Ethernet networks on native Ethernet infrastructures using a link state protocol to advertise both topology and logical network membership. Packets are encapsulated at the edge either in mac-in-mac 802.1ah or tagged 802.1Q/p802.1ad frames and transported only to other members of the logical network. Unicast and multicast is supported and all routing is on symmetric shortest paths. Many equal cost shortest paths are supported.
In December 2011 Shortest path bridging (SPB) was evaluated by the JITC for deployment within the Department of Defense (DoD) because of the ease in integrated OA&M and interoperability with current protocols.[4]
802.1aq is the IEEE-sanctioned link state Ethernet control plane for all IEEE VLANs covered in IEEE 802.1Q.[5] SPBV (Shortest Path Bridging - VID) provides capability that is backwards compatible with spanning tree technologies. SPBM (Shortest Path Bridging – MAC, previously known as SPBB) provides additional values which capitalize on Provider Backbone Bridge (PBB) capabilities. SPB (the generic term for both) combines an Ethernet data path (either IEEE 802.1Q in the case of SPBV, or Provider Backbone Bridges (PBBs) IEEE 802.1ah in the case of SPBM) with an IS-IS link state control protocol running between Shortest Path bridges (NNI links). The link state protocol is used to discover and advertise the network topology and compute shortest path trees from all bridges in the SPB Region.
In SPBM, the Backbone MAC (B-MAC) addresses of the participating nodes and also the service membership information for interfaces to no- participating devices (UNI ports) is distributed. Topology data is the input to a calculation engine which computes symmetric shortest path trees based on minimum cost from each participating node to all other participating nodes. In SPBV these trees provide a shortest path tree where individual MAC address can be learned and Group Address membership can be distributed. In SPBM the shortest path trees are then used to populate forwarding tables for each participating node's individual B-MAC addresses and for Group addresses; Group multicast trees are sub trees of the default shortest path tree formed by (Source, Group) pairing. Depending on the topology several different equal cost multi path trees are possible and SPB supports multiple algorithms per IS-IS instance.
In SPB as with other link state based protocols, the computations are done in a distributed fashion. Each node computes the Ethernet compliant forwarding behavior independently based on a normally synchronized common view of the network (at scales of about 1000 nodes or less) and the service attachment points (UNI ports). Ethernet filtering Database (or forwarding) tables are populated locally to independently and deterministically implement its portion of the network forwarding behavior.
The two different flavors of data path give rise to two slightly different versions of this protocol. One (SPBM) is intended where complete isolation of many separate instances of client LANs and their associated device MAC addresses is desired, and it therefore uses a full encapsulation (MAC-in-MAC a.k.a IEEE 802.1ah). The other (SPBV) is intended where such isolation of client device MAC addresses is not necessary, and it reuses only the existing VLAN tag a.k.a IEEE 802.1Q on participating (NNI) links.
Chronologically SPBV came first, with the project originally being conceived to address scalability and convergence of MSTP.
At the time the specification of Provider Backbone bridging was progressing and it became apparent that leveraging both the PBB data plane and a link state control plane would significantly extend Ethernet's capabilities and applications. Provider Link State Bridging (PLSB) was a strawman proposal brought to the IEEE 802.1aq Shortest Path Bridging Working Group, in order to provide a concrete example of such a system. As IEEE 802.1aq standardisation has progressed, some of the detailed mechanisms proposed by PLSB have been replaced by functional equivalents, but all of the key concepts embodied in PLSB are being carried forward into the standard.
The two flavors (SPBV and SPBM) will be described separately although the differences are almost entirely in the data plane.
Shortest Path bridging enables shortest path trees for VLAN Bridges all IEEE 802.1 data planes and SPB is the term used in general. Recently there has been a lot of focus on SPBM as explained due to its ability to control the new PBB data plane and leverage certain capabilities such as removing the need to do B-MAC learning and automatically creating individual (unicast) and group (multicast) Trees. SPBV was actually the original project that endeavored to enable Ethernet VLANs to better utilize mesh networks.
A primary feature of Shortest Path bridging is the ability to use Link State IS-IS to learn network topology. In SPBV the mechanism used to identify the tree is to use a different Shortest Path VLAN ID (VID) for each source bridge. The IS-IS topology is leveraged both to allocate unique SPVIDs and to enable shortest path forwarding for individual and group addresses. Originally targeted for small low configuration networks SPB grew into a larger project encompassing the latest provider control plane for SPBV and harmonizing the concepts of Ethernet data plane. Proponents of SPB believe that Ethernet can leverage link state and maintain the attributes that have made Ethernet one of the most encompassing data plane transport technologies. When we refer to Ethernet it is the layer 2 frame format defined by IEEE 802.3 and IEEE 802.1. Ethernet VLAN bridging IEEE 802.1Q is the frame forwarding paradigm that fully supports higher level protocols such as IP.
SPB defines a shortest path Region which is the boundary of the shortest path topology and the rest of the VLAN topology (which may be any number of legacy bridges.) SPB operates by learning the SPB capable bridges and growing the Region to include the SPB capable bridges that have the same Base VID and MSTID configuration digest (Allocation of VIDs for SPB purposes).
SPBV builds shortest path trees that support Loop Prevention and optionally support loop mitigation on the SPVID. SPBV still allows learning of Ethernet MAC addresses but it can distribute multicast address that can be used to prune the shortest path trees according to the multicast membership either through MMRP or directly using IS-IS distribution of multicast membership.
SPBV builds shortest path trees but also interworks with legacy bridges running Rapid Spanning Tree Protocol and Multiple Spanning Tree Protocol. SPBV uses techniques from MSTP Regions to interwork with non-SPB regions behaving logically as a large distributed bridge as viewed from outside the region.
SPBV supports shortest path trees but SPBV also builds a spanning tree which is computed from the link state database and uses the Base VID. This means that SPBV can use this traditional spanning tree for computation of the Common and Internal Spanning Tree (CIST). The CIST is the default tree used to interwork with other legacy bridges. It also serves as a fall back spanning tree if there are configuration problems with SPBV.
SPBV has been designed to manage a moderate number of bridges. SPBV differs from SPBM in that MAC addresses are learned on all bridges that lie on the shortest path and a shared VLAN learning is used since destination MACs may be associated with multiple SPVIDs. SPBV learns all MACs it forwards even outside the SPBV region.
SPBM reuses the PBB data plane which does not require that the Backbone Core Bridges (BCB) learn encapsulated client addresses. At the edge of the network the C-MAC (client) addresses are learned. SPBM is very similar to PLSB using the same data and control planes but the format and contents of the control messages in PLSB are not compatible.
Individual MAC frames (unicast traffic) from an Ethernet attached device that are received at the SPBM edge are encapsulated in a PBB (mac-in-mac) IEEE 802.1ah header and then traverse the IEEE 802.1aq network unchanged until they are stripped of the encapsulation as they egress back to the non participating attached network at the far side of the participating network.
Ethernet destination addresses (from UNI port attached devices) perform learning over the logical LAN and are forwarded to the appropriate participating B-MAC address to reach the far end Ethernet destination. In this manner Ethernet MAC addresses are never looked-up in the core of an IEEE 802.1aq network. When comparing SPBM to PBB the behavior is almost identical to a PBB IEEE 802.1ah network. PBB does not specify how B-MAC addresses are learned and PBB may use Spanning tree to control the B-VLAN. In SPBM the main difference is that B-MAC address are distributed or computed in the control plane, eliminating the B-MAC learning in PBB. Also SPBM ensures that the route followed is shortest path tree.
The forward and reverse paths used for unicast and multicast traffic in an IEEE 802.1aq network are symmetric. This symmetry permits the normal Ethernet Continuity Fault Messages (CFM) IEEE 802.1ag to operate unchanged for SPBV and SPBM and has desirable properties with respect to time distribution protocols such as IEEE 1588v2. Also existing Ethernet Loop prevention is augmented by loop mitigation to provide fast data plane convergence.
Group Address and unknown destination individual frames are optimally transmitted to only members of the same Ethernet service. IEEE 802.1aq supports the creation of thousands of logical Ethernet services in the form of E-LINE, E-LAN or E-TREE constructs which are formed between non participating logical ports of the IEEE 802.1aq network. These group address packets are encapsulated with a PBB header which indicates the source participating address in the SA while the DA indicates the locally significant group address this frame should be forwarded on and which source bridge originated the frame. The IEEE 802.1aq multicast forwarding tables are created based on computations such that every bridge which is on the shortest path between a pair of bridges which are members of the same service group will create proper FDB state to forward or replicate frames it receives to that members of that service group. Since the group address computation produce shortest path trees, there is only ever one copy of a multicast packet on any given link. Since only bridges on a shortest path between participating logical ports create FDB state the multicast makes the efficient use of network resources.
The actual group address forwarding operation operates more or less identically to classical Ethernet, the B-DA+B-VID combination are looked up to find the egress set of next hops. The only difference compared with classical Ethernet is that reverse learning is disabled for participating Bridge B-MAC addresses and is replaced with an ingress check and discard (when the frame arrives on an incoming interface from an unexpected source). Learning is however implemented at the edges of the SPBM multicast tree to learn the B-MAC to MAC address relationship for correct individual frame encapsulation in the reverse direction (as packets arrive over the Interface).
Properly implemented an IEEE 802.1aq network can support up to 1000 participating bridges and provide 10's of thousands of layer 2 E-LAN services to Ethernet devices. This can be done by simply configuring the ports facing the Ethernet devices to indicate they are members of a given service. As new members come and go the IS-IS protocol will advertise the I-SID membership changes and the computations will grow or shrink the trees in the participating node network as necessary to maintain the efficient multicast property for that service.
IEEE 802.1aq has the property that only the point of attachment of a service needs configuration when a new attachment point comes or goes. The trees produced by the computations will automatically be extended or pruned as necessary to maintain connectivity. In some existing implementations this property is used to automatically (as opposed to through configuration) add or remove attachment points for dual homed technologies such as rings to maintain optimum packet flow between a non participating ring protocol and the IEEE 802.1aq network by activating a secondary attachment point and deactivating a primary attachment point.
Both SPBV and SPBM inherit key benefits of link state routing:
Virtualisation is becoming an increasingly important aspect of a number of key applications, in both Carrier and Enterprise space, and SPBM, with its MAC-in-MAC datapath providing complete separation between Client and Server layers, is uniquely suitable for these.
"Data Centre virtualisation" articulates the desire to flexibly and efficiently harness available compute resources in a way that may rapidly be modified to respond to varying application demands, without the need to dedicate physical resources to a specific application. One aspect of this is server virtualisation. The other is connectivity virtualisation, because a physically distributed set of server resources must be attached to a single IP subnet, and modifiable in an operationally simple and robust way. SPBM delivers this; because of its client-server model, it offers a perfect emulation of a transparent Ethernet LAN segment, which is the IP subnet seen at Layer 3. A key component of how it does this is implementing VLANs with scoped multicast trees, which means no egress discard of broadcast/unknown traffic, a feature common to approaches that use a small number of shared trees, hence the network does not simply degrade with size as the percentage of frames discarded goes up. It also supports "single touch" provisioning, so that configuration is simple and robust; the port of a virtual server must simply be bound locally to the SPBM I-SID identifying the LAN segment, after which IS-IS for SPB floods this binding, and all nodes that need to install forwarding state to implement the LAN segment do so automatically.
The Carrier-space equivalent of this application is the delivery of Ethernet VPN services to Enterprises over common Carrier infrastructure. The required attributes are fundamentally the same; complete transparency for customer Ethernet services (both point-to-point and LAN), and complete isolation between one customer's traffic and that of all other customers. The multiple virtual LAN segment model provides this, and the single-touch provisioning model eases carrier operations. Furthermore, the MAC-in-MAC datapath allows the carrier to deploy the "best in class" Ethernet OAM suit (IEEE 802.1ag, etc), entirely transparently and independently from any OAM which a customer may choose to run.
A further consequence of SPBM's transparency in both dataplane and control plane is that it provides a perfect, "no compromise" delivery of the complete MEF 6.1 service set. This includes not only E-LINE and E-LAN constructs, by also E-TREE (hub-and-spoke) connectivity. This latter is clearly very relevant to Enterprises customers of Carrier VPN services which have this network structure internally. It also provides the carrier with the toolkit to support geo-redundant broadband backhaul; in this applications, many DSLAMs or other access equipments must be backhauled to multiple BNG sites, with application-determined binding of sessions to a BNG. However, DLSAMs must not be allowed to communicate with each other, because carriers then lose the ability to control peer-to-peer connectivity MEF E-TREE does just this, and further provides an efficient multicast fabric for the distribution of IP-TV.
SPBM offers both the ideal multicast replication model, where packets are replicated only at fork points in the shortest path tree that connects members, and also the less state intensive head end replication model where in essence serial unicast packets are sent to all other members along the same shortest path first tree. These two models are selected by specifying properties of the service at the edge which affect the transit node decisions on multicast state installation. This allows for a trade-off to be made between optimum transit replication points (with their larger state costs) v.s. reduced core state (but much more traffic) of the head end replication model. These selections can be different for different members of the same ISID allowing different trade-offs to be made for different members.
Figure 5 below is a quick way to understand what SPBM is doing on the scale of the entire network. Figure 5 shows how a 7 member E-LAN is created from the edge membership information and the deterministic distributed calculation of per source, per service trees with transit replication. Head end replication is not shown as it is trivial and simply uses the existing unicast FIBs to forward copies serially to the known other receivers.
Failure recovery is as per normal IS-IS with the link failure being advertised and new computations being performed, resulting in new FDB tables. Since no Ethernet addresses are advertised or known by this protocol, there is no re-learning required by the SPBM core and its learned encapsulations are unaffected by a transit node or link failure.
Fast link failure detection may be performed using IEEE 802.1ag Continuity Check Messages (CCMs) which test link status and report a failure to the IS-IS protocol. This allows much faster failure detection than is possible using the IS-IS hello message loss mechanisms.
Both SPBV and SPBM inherit the rapid convergence of a link state control plane. A special attribute of SPBM is its ability to rebuild multicast trees in a similar time to unicast convergence, because it substitutes computation for signaling. When an SPBM bridge has performed the computations on a topology database, it knows whether it is on the shortest path between a root and one or more leaves of the SPT and can install state accordingly. Convergence is not gated by incremental discovery of a bridge’s place on a multicast tree by the use of separate signaling transactions. However, SPBM on a node does not operate completely independently of its peers, and enforces agreement on the current network topology with its peers. This very efficient mechanism uses exchange of a single digest of link state covering the entire network view, and does not need agreement on each path to each root individually. The result is that the volume of messaging exchanged to converge the network is in proportion to the incremental change in topology and not the number of multicast trees in the network. A simple link event that may change many trees is communicated by signaling the link event only; the consequent tree construction is performed by local computation at each node. The addition of a single service access point to a service instance involves only the announcement of the I-SID, regardless of the number of trees. Similarly the removal of a bridge, which might involve the rebuilding of hundreds to thousands of trees, is signaled only with a few link state updates.
Commercial offerings will likely offer SPB over multi-chassis lag. In this environment multiple switch chassis appear as a single switch to the SPB control plane, and multiple links between pairs of chassis appear as an aggregate link. In this context a single link or node failure is not seen by the control plane and is handled locally resulting in sub 50ms recovery times.
802.1aq builds on all existing Ethernet OA&M. Since 802.1aq ensures that its unicast and multicast packets for a given VLAN follow the same forward and reverse path and use completely standard 802 encapsulations, all of the methods of 802.1ag and Y.1731 operate unchanged on an 802.1aq network.
See IEEE 802.1ag and ITU-recommendation Y.1731 (external link below).
Sixteen ECMT paths are initially defined, however there are many more possible. ECMT in an IEEE 802.1aq network is more predictable than with IP or MPLS because of symmetry between the forward and reverse paths. The choice as to which ECMT path will be used is therefore an operator assigned head end decision while it is a local / hashing decision with IP/MPLS.
IEEE 802.1aq, when faced with a choice between two equal link cost paths, uses the following logic for its first ECMT tie breaking algorithm: first, if one path is shorter than the other in terms of hops, the shorter path is chosen, otherwise, the path with the minimum Bridge Identifier { BridgePriority concatenated with (IS-IS SysID) } is chosen. Other ECMT algorithms are created by simply using known permutations of the BridgePriority||SysIds. For example the second defined ECMT algorithm uses the path with the minimum of the inverse of the BridgeIdentifier and can be thought of as taking the path with the maximum node identifier. For SPBM, each permutation is instantiated as a distinct B-VID. The upper limit of multipath permitations is gated by the number of B-VIDs delegated to 802.1aq operation, a maximum of 4094, although the number of useful path permutations would only require a fraction of the available B-VID space. Fourteen additional ECMT algorithms are defined with different bit masks applied to the BridgeIdentifiers. Since the BridgeIdentfier includes a priority field, it is possible to adjust the ECMT behavior by changing the BridgePriority up or down.
A service is assigned to a given ECMT B-VID at the edge of the network by configuration. As a result non participating packets associated with that service are encapsulated with the VID associated with the desired ECMT end to end path. All individual and group address traffic associated with this service will therefore use the proper ECMT B-VID and be carried symmetrically end to end on the proper equal cost multi path. Essentially the operator decides which services go in which ECMT paths, unlike a hashing solution used in other systems such as IP/MPLS. Trees can support link aggregation (LAG) groups within a tree "branch" segment where some form of hashing occurs.
This symmetric and end to end ECMT behavior gives IEEE 802.1aq a highly predictable behavior and off line engineering tools can accurately model exact data flows. The behavior is also advantageous to networks where one way delay measurements are important. This is because the one way delay can be accurately computed as 1/2 the round trip delay. Such computations are used by time distribution protocols such as IEEE 1588 for frequency and time of day synchronization as required between precision clock sources and wireless base stations.
Shown below are three figures [5,6,7] which show 8 and 16 ECT behavior in different network topologies. These are composites of screen captures of an 802.1aq network emulator and show the source in purple, the destination in yellow, and then all the computed and available shortest paths in pink. The thicker the line, the more shortest paths use that link. The animations shows three different networks and a variety of source and destination pairs which continually change to help visualize what is happening.
The ECT algorithms can be almost extended through the use of OPAQUE data which allows extensions beyond the base 16 algorithms more or less infinitely. It is expected that other standards groups or vendors will produce variations on the currently defined algorithms with behaviors suited for different networks styles. It is expected that numerous shared tree models will also be defined, as will hop by hop hash based ECMP style behaviors .. all defined by a VID and an algorithm that every node agrees to run.
We will work through SPBM behavior on a small example, with emphasis on the shortest path trees for unicast and multicast.
The network shown below [in Figure 1] consists of 8 participating nodes numbered 0 through 7. These would be switches or routers running the IEEE 802.1aq protocol. Each of the 8 participating nodes has a number of adjacencies numbered 1..5. These would likely correspond to interface indexes, or possibly port numbers. Since 802.1aq does not support parallel interfaces each interface corresponds to an adjacency. The port / interface index numbers are of course local and are shown because the output of the computations produce an interface index (in the case of unicast) or a set of interface indexes (in the case of multicast) which are part of the forwarding information base (FIB) together with a destination MAC address and backbone VID.
The network above has a fully meshed inner core of four nodes (0..3) and then four outer nodes (4,5,6 and 7), each dual-homed onto a pair of inner core nodes.
Normally when nodes come from the factory they have a MAC address assigned which becomes a node identifier but for the purpose of this example we will assume that the nodes have MAC addresses of the form 00:00:00:N:00:00 where N is the node id (0..8) from Figure 1. Therefor node 2 has a MAC address of 00:00:00:00:02:00. Node 2 is connected to node 7 (00:00:00:00:07:00) via interface/5.
The IS-IS protocol runs on all the links shown since they are between participating nodes. The IS-IS hello protocol has a few additions for 802.1aq including information about backbone VIDs to be used by the protocol. We will assume that the operator has chosen to use backbone VIDs 101 and 102 for this instance of 802.1aq on this network.
The node will use their MAC addresses as the IS-IS SysId and joing a single IS-IS level and exchange link state packets (LSPs in IS-IS terminology). The LSPs will contain node information and link information such that every node will learn the full topology of the network. Since we have not specified any link weights in this example, the IS-IS protocol will pick a default link metric for all links, therefore all routing will be minimum hop count.
After topology discovery the next step is distributed calculation of the unicast routes for both ECMP VIDs and population of the unicast forwarding tables (FIBs).
Consider the route from Node 7 to Node 5: there are a number of equal cost paths. 802.1aq specifies how to choose two of them: the first is referred to as the Low PATH ID path. This is the path which has the minimum node id on it. In this case the Low PATH ID path is the 7->0->1->5 path (as shown in red in Figure 2). Therefore each node on that path will create a forwarding entry toward the MAC address of node five using the first ECMP VID 101. Conversely, 802.1aq specifies a second ECMP tie breaking algorithm called High PATH ID. This is the path with the maximum node identifier on it and in the example is the 7->2->3->5 path (shown in blue in Figure 2).
Node 7 will therefore have a FIB that among other things indicates:
Node 5 will have exactly the inverse in its FIB:
The intermediate nodes will also produce consistent results so for example node 1 will have the following entries.
And Node 2 will have entries as follows:
If we had an attached non participating device at Node 7 talking to a non participating device at Node 5 (for example Device A talks to Device C in Figure 3), they would communicate over one of these shortest paths with a MAC-in-MAC encapsulated frame. The MAC header on any of the NNI links would show an outer source address of 00:00:00:70:00, an outer destination address of 00:00:00:50:00 and a BVID of either 101 or 102 depending on which has been chosen for this set of non participating ports/vids. The header once inserted at node 7 when received from node A, would not change on any of the links until it egressed back to non participating Device C at Node 5. All participating devices would do a simple DA+VID lookup to determine the outgoing interface, and would also check that incoming interface is the proper next hop for the packet's SA+VID. The addresses of the participating nodes 00:00:00:00:00:00 ... 00:00:00:07:00 are never learned but are advertised by IS-IS as the node's SysId.
Unicast forwarding to a non-participating client (e.g. A, B, C, D from Figure 3) address is of course only possible when the first hop participating node (eg 7) is able to know which last hop participating node (eg 5) is attached to the desired non participating node (eg C). Since this information is not advertised by IEEE 802.1aq it has to be learned. The mechanism for learning is identical to IEEE 802.1ah, in short, the corresponding outer MAC unicast DA, if not known is replaced by a multicast DA and when a response is received, the SA of that response now tells us the DA to use to reach the non participating node that sourced the response. eg node 7 learns that C is reached by node 5.
Since we wish to group/scope sets of non participating ports into services and prevent them from multicasting to each other, IEEE 802.1aq provides mechanism for per source, per service multicast forwarding and defines a special multicast destination address format to provide this. Since the multicast address must uniquely identify the tree, and because there is a tree per source per unique service, the multicast address contains two components, a service component in the low order 24 bits and a network wide unique identifier in the upper 22 bits. Since this is a multicast address the multicast bit is set, and since we are not using the standard OUI space for these manufactured addresses, the Local 'L' bit is set to disambiguate these addresses. In Figure 3 above, this is represented with the DA=[7,O] where the 7 represents packets originating from node 7 and the colored O represents the E-LAN service we are scoped within.
Prior to creating multicast forwarding for a service, nodes with ports that face that service must be told they are members. For example nodes 7,4,5 and 6 are told they are members of the given service, for example service 200, and further that they should be using bvid 101. This is advertised by ISIS and all nodes then do the SPBM computation to determine if they are participating either as a head end or tail end, or a tandem point between other head and tail ends in the service. Since node 0 is a tandem between nodes 7 and 5 it creates a forwarding entry for packets from node 7 on this service, to node 5. Likewise, since it is a tandem between nodes 7 and 4 it creates forwarding state from node 7 for packets in this service to node 4 this results in a true multicast entry where the DA/VID have outputs on two interfaces 1 and 2. Node 2 on the other hand is only on one shortest path in this service and only creates a single forwarding entry from node 7 to node 6 for packets in this service.
Figure 3 only shows a single E-LAN service and only the tree from one of the members, however very large numbers of E-LAN services with membership from 2 to every node in the network can be supported by advertising the membership, computing the tandem behaviors, manufacturing the known multicast addresses and populating the FIBs. The only real limiting factors are the FIB table sizes and computational power of the individual devices both of which are growing yearly in leaps and bounds.
802.1aq does not spread traffic on a hop by hop basis. Instead, 802.1aq allows assignment of an ISID (service) to a VID at the edge of the network. A VID will correspond to exactly one of the possible sets of shortest paths in the network and will never stray from that routing. If there are 10 or so shortest paths between different nodes, it is possible to assign different services to different paths and to know that the traffic for a given service will follow exactly the given path. In this manner traffic can easily be assigned to the desired shortest path. In the event that one of the paths becomes overloaded it is possible to move some services off those shortest paths by re-assigning the services ISID to a different, less loaded, VID at the edges of the network.
The deterministic nature of the routing makes offline prediction/computation/experimentation of the network loading much simpler since actual routes are not dependent on the contents of the packet headers with the exception of the VLAN identifier.
Figure 4 shows four different equal cost paths between nodes 7 and 5. An operator can achieve relatively good balance of traffic across the cut between nodes [0 and 2] and [1 and 3] by assigning the services at nodes 7 and 5 to one of the four desired VIDs. Using more than 4 ECT paths in the network will likely allow all 4 of these paths to be used. Balance can also be achieved between nodes 6 and 5 in a similar manner.
In the event that an operator does not wish to manually assign services to shortest paths it is a simple matter for a switch vendor to allow a simple hash of the ISID to one of the available VIDS to give a degree of non-engineered spreading. For example the ISID modulo the number of ECT-VIDs could be used to decide on the actual relative VID to use.
In the event that the ECT paths are not sufficiently diverse the operator has the option of adjusting the inputs to the distributed ECT algorithms to apply attraction or repulsion from a given node by adjusting that node's Bridge Priority. This can be experimented with via offline tools until the desired routes are achieved at which point the bias can be applied to the real network and then ISIDs can be moved to the resulting routes.
Looking at the animations in Figure 6 shows the diversity available for traffic engineering in a 66 node network. In this animation there are 8 ECT paths available from each highlighted source to destination and therefore services could be assigned to 8 different pools based on the VID. One such initial assignment in Figure 6 could therefore be (ISID modulo 8) with subsequent fine tuning as required.
Following are three animated GIFs which help to show the behavior of 802.1aq.
The first of these gifs, shown in Figure 5, demonstrates the routing in a 66 node network where we have created a 7 member E-LAN using ISID 100. In this example we show the ECT tree created from each member to reach all of the other members. We cycle through each member to show the full set of trees created for this service. We pause at one point to show the symmetry of routing between two of the nodes and emphasize it with a red line. In each case the source of the tree is highlighted with a small purple V.
The second of these animated gifs, shown in Figure 6, demonstrates 8 ECT paths in the same 66 node network as Figure 4. In each subsequent animated frame the same source is used (in purple) but a different destination is shown (in yellow). For each frame, all of the shortest paths are shown superimposed between the source and destination. When two shortest paths traverse the same hop, the thickness of the lines being drawn is increased. In addition to the 66 node network, a small multi level Data Center style network is also shown with sources and destinations both within the servers (at the bottom) and from servers to the router layer at the top. This animation helps to show the diversity of the ECT being produced.
The last of these animated gifs, shown in Figure 7, demonstrates source destination ECT paths using all 16 of the standard algorithms currently defined.
802.1aq takes IS-IS topology information augmented with service attachment (I-SID) information, does a series of computations and produces a forwarding table (filtering table) for unicast and multicast entries.
The IS-IS extensions that carry the information required by 802.1aq are given in the isis-layer2 IETF document listed below.
An implementation of 802.1aq will first modify the IS-IS hellos to include an NLPID (network layer protocol identifier) of 0xC01 in their Protocols-Supported TLV (type 129) which has been reserved for 802.1aq. The hellos also must include an MSTID (which gives the purpose of each VID) and finally each ECMT behavior must be assigned to a VID and exchanged in the hellos. The hellos would normally run untagged. Note that NLPID of IP is not required to form an adjacency for 802.1aq but also will not prevent an adjacency when present.
The links are assigned 802.1aq specific metrics which travel in their own TLV which is more or less identical to the IP link metrics. The calculations will always use the minimum of the two unidirectional link metrics to enforce symmetric route weights.
The node is assigned a mac address to identify it globally and this is used to form the IS-IS SYSID. A box mac would normally serve this purpose. The Area-Id is not directly used by 802.1aq but should of course be the same for nodes in the same 802.1aq network. Multiple areas/levels are not yet supported.
The node is further assigned an SPSourceID which is a 24 bit network wide unique identifier. This can often be the low 3 bytes of the SYSID (if unique) or can be dynamically negotiated or manually configured.
The SPSourceID and the ECMT assignments to B-VIDs are then advertised into the IS-IS network in their own 802.1aq TLV.
The 802.1aq computations are restricted to links between nodes that have an 802.1aq link weight and which support the NLPID 0xC01. As previously discussed the link weights are forced to be symmetric for the purpose of computation by taking the min of two dissimilar values.
When a service is configured in the form of an I-SID assignment to an ECMT behavior that I-SID is then advertised along with the desired ECMT behavior and an indication of its transmit, receive properties (a new TLV is used for this purpose of course).
When an 802.1aq node receives an IS-IS update it will compute the unique shortest path to all other IS-IS nodes that support 802.1aq. There will be one unique (symmetric) shortest path per ECMT behavior. The tie breaking used to enforce this uniqueness and ECMT is described below.
The unicast FDB/FIB will be populated based on this first shortest path computation. There will be one entry per ECMT behavior/B-VID produced.
The transit multicast compuation (which only applies when transit replication is desired and not applicable to services that have chosen head end replication) can be implemented in many ways, care must be taken to keep this efficient, but in general a series of shortest path computations must be done. The basic requirement is to decide 'am I on the shortest path between two nodes one of which transmits an I-SID and the other receives that I-SID.'
Rather poor performing pseudo-code for this computation looks something like this:
for each NODE in network which originates at least one transmit isid do { SPF = compute the shortest path trees from NODE for all ECMT B-VIDs. for each ECMT behavior { for each NEIGHBOR of NODE { if NEIGHBOR is on the SPF towards NODE for this ECMT { T = NODE's transmit ISID's unioned with all receive ISIDs below us on SPF for each ISID in T { create/modify multicast entry where [MAC-DA = NODE.SpsourceID:20||ISID:24||LocalBit:1||MulticastBit:1 B-VID = VID associated with this ECMT out port = interface to NEIGHBOR in port = port towards NODE on the SPF for this ECMT] } } } } }
The above pseudo code computes many more SPF's than strictly necessary in most cases and better algorithms are known to decide if a node is on a shortest path between two other nodes. A reference to a paper presented at the IEEE which gives a much faster algorithm that drastically reduces the number of outer iterations required is given below.
In general though even the exhaustive algorithm above is more than able to handle several hundred node networks in a few 10's of milliseconds on the 1 GHz or greater common CPUs when carefully crafted.
For ISIDs that have chosen head end replication the computation is trivial and involves simply finding the other attachment points that receive that ISID and creating a serial unicast table to replicate to them one by one.
The first public interoperability tests of IEEE 802.1aq were held in Ottawa in October 2010. Two vendors provided SPBM implementations and a total of 5 physical switches and 32 emulated switches were tested for control/data and OA&M. The interop report slides are linked below. The most important aspects are the hardware forwarding at line rate support by the vendors and the full OA&M including L2 ping and L2 traceroute.[6]
A second private interopability test was conducted in Ottawa in January 2011. This involved 9 switches and focussed on the data plane and testing VM motion over a multi vendor network. Virtual machines were successfully moved over the multi vendor network.
The third public interopability test was conducted in Ottawa in June 2011. This involved 5 vendors and 6 implementations. 10 physical switches were used and a 187 node network was formed by including 2 different simulators. A novel network viewer was also tested which formed a single adjacency, learned the network and drew the network and its status in real time. The interop report slides used to present the summary of this work to the various standards bodies are linked below.[7]
802.1aq must produce deterministic symmetric downstream congruent shortest paths. This means that not only must a given node compute the same path forward and reverse but all the other nodes downstream (and upstream) on that path must also produce the same result. This downstream congruence is a consequence of the hop by hop forwarding nature of Ethernet since only the destination address and VID are used to decide the next hop. It is important to keep this in mind when trying to design other ECMT algorithms for 802.1aq as this is an easy trap to fall into.
We start by taking the unidirectional link metrics that are advertised by ISIS for 802.1aq and ensuring that they are symmetric. This is done by simply taking the MIN of the two values at both ends prior to doing any computations. This alone does not guarantee symmetry however.
The 802.1aq standard describes a mechanism called a PATHID which is a network-wide unique identifier for a path. This is a useful logical way to understand how to deterministically break ties but is not how one would implement such a tie-breaker in practice. The PATHID is defined as just the sequence of SYSIDs that make up the path (not including the end points).. sorted. Every path in the network therefore has a unique PATHID independent of where in the network the path is discovered.
802.1aq simply always picks the lowest PATHID path when a choice presents itself in the shortest path computations. This ensures that every node will make the same decision.
For example in Figure 7 above, there are four equal-cost paths between node 7 and node 5 as shown by the colors blue, green, pink and brown. The PATHID for these paths are as follows:
The lowest PATHID is therefore the brown path {0,1}.
This low PATHID algorithm has very desirable properties. The first is that it can be done progressively by simply looking for the lowest SYSID along a path and secondly because an efficient implementation that operates stepwise is possible by simply back-tracking two competing paths and looking for the minimum of the two paths minimum SYSIDs.
The low PATHID algorithm is the basis of all 802.1aq tie breaking. ECMT is also based on the low PATHID algorithm by simply feeding it different SYSID permutaitons – one per ECMT algorithm. The most obvious permutation to pass is a complete inversion of the SYSID by XOR-ing it with 0xfff... prior to looking for the min of two minimums. This algorithm is referred to as high PATHID because it logically chooses the largest PATHID path when presented with two equal-cost choices.
In the example in figure 7, the path with the highest PATHID is therefore the blue path whose PATHID is {2,3}. Simply inverting all the SYSIDs and running the low PATHID algorithm will yield same result.
The other 14 defined ECMT algorithms use different permutations of the SYSID by XOR-ing it with different bit masks which are designed to create relatively good distribution of bits. It should be clear that different permutations will result in the purple and green paths being lowest in turn.
The 17 individual 64-bit masks used by the ECT algorithm are made up of the same byte value repeated eight times to fill each 64-bit mask. These 17 byte values are as follows:
ECT-MASK[17] = { 0x00, 0x00, 0xFF, 0x88, 0x77, 0x44, 0x33, 0xCC, 0xBB, 0x22, 0x11, 0x66, 0x55, 0xAA, 0x99, 0xDD, 0xEE };
ECT-MASK[0] is reserved for a common spanning tree algorithm, while ECT-MASK[1] creates the Low PATHID set of shortest path first trees, ECT-MASK[2] creates the High PATHID set of shortest path trees and the other indexes create other relatively diverse permutations of shortest path first trees.
In addition the ECMT tie-breaking algorithms also permit some degree of human override or tweaking. This is accomplished by including a BridgePriority field together with the SYSID such that the combination, called a BridgeIdentfier, becomes the input to the ECT algorithm. By adjusting the BridgePriority up or down a path's PATHID can be raised or lowered relative to others and a substantial degree of tunability is afforded.
The above description gives an easy to understand way to view the tie breaking; an actual implementation simply backtracks from the fork point to the join point in two competing equal-cost paths (usually during the Dijkstra shortest path computation) and picks the path traversing the lowest (after masking) BridgePriority|SysId.
|